{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\"/> \n",
    "    \n",
    "### <center> Author: Maxim Keremet, @maximkeremet\n",
    "    \n",
    "## <center> Tutorial\n",
    "### <center> Webscraping an online retailer assortment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, when it comes to retrieving data from a website one can hear different notions: `parsing`, `(web)scraping` and `crawling`. Let's first understand if there is any difference between those notions and why everybody is using different terms.<br>\n",
    "\n",
    "After a brief googling, one can come to conclusion that:<br>\n",
    "`Parsing` is just getting information basically from any data source (logs/tables or files)<br>\n",
    "`(Web)scraping` is essentially getting data from a web page<br>\n",
    "`Crawling` is the process of moving around the website<br>\n",
    "\n",
    "So, when somebody speaks about retrieving data from a webpage and a recursive data extraction from a website, he/she will probably use one of the listed above words. In reality, these notions are `intependent`, but `interrelated` and `consequent` (first you scrape one page, crawl to another, scrape another page etc.). <br>\n",
    "When terms are defined, and we understand them, its time to dive into details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. A powerful webscraping library -  [`Scrapy`](https://scrapy.org/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[`Scrapy`](https://scrapy.org/),formally, is kinda more than a library - it is believed to be a framework, a powerful tool to extract data from websites and automatize this process with a few code written by the programmer. <br>\n",
    "The strong point of [`Scrapy`](https://scrapy.org/) is that it has a bunch of template `spiders` (programs that go around the targeted locations and search for needed data), that can be adjusted in a blink of an eye for user-specific need with a few lines of python and bash. <br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, I was stralling around the internet prior to the \"Black Friday\" to find a tempting offer. And I thought that it would be interesting to get phone assortment at [Svyaznoy](https://www.svyaznoy.ru), one of the biggest online retailer in Russia. <br>\n",
    "\n",
    "So the ultimate **goal** of this tutorial is to get the **phone name**, its **price**, **a discount** (if there is one) and **a phone photo** and beautifully store them.<br>\n",
    "So i went to [Svyaznoy](https://www.svyaznoy.ru) website and looked in the phone assortment.<br>\n",
    "<br>\n",
    "<center>\n",
    "<img src=\"https://habrastorage.org/webt/n8/ko/9z/n8ko9zzuwrpbr7x875w-od8mfoq.png\" />\n",
    "<img src=\"https://habrastorage.org/webt/l3/p2/bd/l3p2bdxwq2nwocopsknagurdmrk.png\" width=700/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And saw that there are **109** pages and something around 2,5K of phone articles, which would be tough and tyring even to look though.<br>\n",
    "\n",
    "_P.S._ We also note the link to the 1st page (which looks weird), beacuse we will need it further on.<br>\n",
    "<br>\n",
    "\n",
    "<center>\n",
    "<img src=\"https://habrastorage.org/webt/hv/na/d3/hvnad3pxrqlmwgwwrnc-2o8th9g.png\" />\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 Creating a crawler."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, install the library if you don't have it.\n",
    "\n",
    "`conda install -c conda-forge scrapy` or `pip install scrapy`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pwd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scrapy works in terms of projects. So you create a default project with a bunch of scripts that Scrapy runs to get the data from defined locations. <br>\n",
    "Let's create a default project."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!scrapy startproject svyaznoy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look in the project folder that we just have created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cd svyaznoy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So there is some config file and a folder with Scrapy-specific scripts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we have to create `spider`, a key program defines which location to crawl and whcih data to collect.<br>\n",
    "For similarity, we shall call in `svz.py`<br>\n",
    "We will also pass the webpage, so the Scrapy would identify the structure.<br>\n",
    "_P.S. You cannot call the`spider` the same as the project (I guess it just disturbs everybody)._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!scrapy genspider svz https://www.svyaznoy.ru/catalog/phone/224"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that our `spider` has been created in location: `svyaznoy.spiders.svz`<br>\n",
    "Proceed to `svyaznoy` to see what is inside."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cd svyaznoy/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look inside the `spiders` folder and see our script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cd spiders/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look inside the `spider` script."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat svz.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2. Understanding what we want to get from the website. Dealing with developer tools."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We all know that everythin in `python` can be regarded as an object. The same applies to the website.<br>\n",
    "Every phone is embeded in some kind of a card, that has all characteristics, like name, price, discounts, rating andd others and everything can be regarded as separate objects. A card can be recarded as an object as well and set of goods therefore is a set of cards.<br>\n",
    "\n",
    "Here is what I mean by card of the good:\n",
    "\n",
    "<center>\n",
    "<img src=\"https://habrastorage.org/webt/qs/zl/ns/qszlnsmu8kxjzeziaewsfi8ggok.png\" width=400/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Everybody knows about the developer tools, so we can click with the right button to inspect our objects and understand where are needed objects are located in terms of HTML/XML markup.<br>\n",
    "\n",
    "<center>\n",
    "<img src=\"https://habrastorage.org/webt/do/gr/t0/dogrt0irx2s4pl1e6qduzdimi0a.png\" width=900/>\n",
    "   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we can see the phone price block:\n",
    "<img src=\"https://habrastorage.org/webt/tt/eh/-y/tteh-y3yc4qmozsg5hgdpqnzugq.png\" width=900/>\n",
    "The phone name block:\n",
    "<img src=\"https://habrastorage.org/webt/ww/nn/g1/wwnng1udiblmmah2uoi22zvytms.png\" width=900/>\n",
    "The photo link block:\n",
    "<img src=\"https://habrastorage.org/webt/mn/z0/sb/mnz0sb09l2yfpsld45j17whpaua.png\" width=900/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have to understand what **HTML** and **XML** is, how they differ and how efficiently retrieve information from a website markup. <br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3. Brief understanding of XML and Xpath."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The _key_ and the only difference between the 2 guys that we will need is that, **XML** is used for storing and transporting data, while **HTML**  is used for formatting and displaying the same data.<br>\n",
    "\n",
    "\n",
    "What is **XML**?\n",
    "\n",
    "* Although XML looks a lot like HTML, but it has absolutely different purpose and guts. XML stands short for `eXtensible Markup Language`, which actually explains itself.<br>\n",
    "\n",
    "* Surprisingly, but XML doesn't really do anything, it just structures, stores and transports data upon request.<br>\n",
    "\n",
    "* One of the reasons why it is called eXtensible is because you can invent your own tags, that helps you navigating the data the way you like, while HTML has predefined tags and all HTML documents are based on standartised tags, like `<body>, <p>, <li>` etc.<br>\n",
    "\n",
    "* This helps the developer invent own tags and structure the data the way if fits the nature of the document. However XML is not a replacement for HTML, but is an extension (seriously, man?).<br>\n",
    "\n",
    "So, in most of the web solutions they word in synergy, XML transports and HTML formats and displys the data nicely.\n",
    "All this maked XML a vital tool for the internet and is utilized everywhere, where one has to transport the data between all kinds of applications.<br>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is how **XML code** looks:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<bookstore>\n",
    "   <book category=\"COOKING\">\n",
    "      <title lang=\"en\">Everyday Italian</title>\n",
    "      <author>Giada De Laurentiis</author>\n",
    "      <year>2005</year>\n",
    "      <price>30.00</price>\n",
    "   </book>\n",
    "   <book category=\"CHILDREN\">\n",
    "      <title lang=\"en\">Harry Potter</title>\n",
    "      <author>J K. Rowling</author>\n",
    "      <year>2005</year>\n",
    "      <price>29.99</price>\n",
    "   </book>\n",
    "   <book category=\"WEB\">\n",
    "      <title lang=\"en\">XQuery Kick Start</title>\n",
    "      <author>James McGovern</author>\n",
    "      <author>Per Bothner</author>\n",
    "      <author>Kurt Cagle</author>\n",
    "      <author>James Linn</author>\n",
    "      <author>Vaidyanathan Nagarajan</author>\n",
    "      <year>2003</year>\n",
    "      <price>49.99</price>\n",
    "   </book>\n",
    "   <book category=\"WEB\">\n",
    "      <title lang=\"en\">Learning XML</title>\n",
    "      <author>Erik T. Ray</author>\n",
    "      <year>2003</year>\n",
    "      <price>39.95</price>\n",
    "   </book>\n",
    "</bookstore>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or it can also be represented in a tree-form, which can be easier to grasp.\n",
    "<img src=\"https://habrastorage.org/webt/pp/x7/0c/ppx70czyee88a9wjyqm-3s7kw0w.gif\" width=500/>\n",
    "<br> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is **XPath**?<br>\n",
    "\n",
    "* XPath is a special language to identify parts of XML documents, search and select information.<br>\n",
    "\n",
    "* It uses path expressions to navigate, that look a lot like queries.<br>\n",
    "\n",
    "* It also has a list of functions (logical and numerical) to test the data.<br>\n",
    "\n",
    "\n",
    "And this is how query structures look like. With the help of those we can drill into **XML** notation with **Xpath** to data elements, using hierarchical selectors:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "|XPath expression | Result   |\n",
    "|------|------|\n",
    "|   **/bookstore/book[1]**   |Selects the first book element that is the child of the bookstore element |\n",
    "|**/bookstore/book[last( )]**|Selects the last book element that is the child of the bookstore element|\n",
    "|**/bookstore/book[last( )-1]**|Selects the last but one book element that is the child of the bookstore element|\n",
    "|**/bookstore/book[position( )<3]**|Selects the first two book elements that are children of the bookstore element|\n",
    "|**//title[@lang]**|Selects all the title elements that have an attribute named lang|\n",
    "|**//title[@lang='eng']**|Selects all the title elements that have a \"lang\" attribute with a value of \"en\"|\n",
    "|**/bookstore/book[price>35.00]**|Selects all the book elements of the bookstore element that have a price element with a value greater than 35.00|\n",
    "|**/bookstore/book[price>35.00]/title**|Selects all the title elements of the book elements of the bookstore element that have a price element with a value greater than 35.00|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For now, it is sufficient to know that textual data on the website is stored in in **XML** format and we can efficiently retrieve that by quering with **Xpath**. <br>\n",
    "**Xpath** in its turn, is a qury language, that is base on hierarcical tag structure, forming a tree-based structure, which can be easily decomposed, selected and manipulated for our purposes.<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Getting to know the website throught terminal."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Scrapy` has an interactive shell where you can debug your scraping code very quickly and try out selecting data without running a spider every time.<br>\n",
    "_Try it yourself!_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scrapy shell # this will start the shell"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fetch(\"https://www.svyaznoy.ru/catalog/phone/224\") # get the structure of the web page"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(response.text) # bring back the bare html and css, like in developer tools"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1. Getting phone names."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since you cannot execute everything in `Jupyter notebooks` (unfortuantely), we test/debug via `scpary shell` in terminal, Command line or other console app, and use `Xpath` notation in order to drill in the tag tree and get the needed data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Titles can be found like so:<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response.xpath(\"//div[@class='b-product-block__image']//img[@class='lazy']/@title\").extract() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write them to an object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "titles = response.xpath(\"//div[@class='b-product-block__image']//img[@class='lazy']/@title\").extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2. Getting photos."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Out photo links are also located in `b-product-block__image` class and we can extract them the following way:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "imgs = response.xpath(\"//div[@class='b-product-block__image']//img[@class='lazy']/@data-original\").extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3. Getting prices."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Prices are located a bit further in the tree in `b-product-block__image` block and span of `b-product-block__visible-price`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response.xpath(\".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()\").extract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, the raw data is dirty and we will have to clean it up, using some time, magic and regular expressions (which are actually are equivalent to magic)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is how we can clear it up:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prices = [price.replace(\"\\n\", \"\") for price in response.xpath(\".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()\").extract()]\n",
    "prices = [price.replace(\"\\xa0\", \"\") for price in prices]  # cleaning from non-breaking space in Latin1(ISO 8859-1)\n",
    "prices = [price.strip() for price in prices] # cleaning from unwanted spaces \n",
    "prices = [int(price) for price in prices if price] # turning string objects to integers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4. Getting sale offers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is pretty much the same as with prices, but there are cases, when there is no sale offer for an item, so we will have to be a bit witty here.<br>\n",
    "Below you can find a list comprehension of how you get the sale offer. So, in this case we test, if there is an item then we extract it, otherwise fill our object with string zero."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[response.xpath(\".//div[@class='b-product-block__gain']\").extract_first() if  \\\n",
    "'b-product-block__gain' in i else '' \\\n",
    "for i in response.xpath(\".//div[@class='b-product-block__price']\").extract()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "sales = [response.xpath(\".//div[@class='b-product-block__gain']\").extract_first() if 'b-product-block__gain' in i else '0' for i in response.xpath(\".//div[@class='b-product-block__price']\").extract()]\n",
    "sales = [sale.replace(\"\\xa0\", \"\") for sale in sales]  # cleaning from non-breaking space in Latin1(ISO 8859-1)\n",
    "sales = [sale.strip() for sale in sales] # cleaning from unwanted spaces \n",
    "sales = [re.findall(\"\\d+\", sale) for sale in sales] # finding all objects that contain digits in our list of ojects\n",
    "sales = [item for sublist in sales for item in sublist] # flatten the list of lists \n",
    "sales = [int(sale) for sale in sales] # turning string objects to integers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Compling it all together."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So we put all things together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retriving objects\n",
    "imgs = response.xpath(\"//div[@class='b-product-block__image']//img[@class='lazy']/@data-original\").extract()\n",
    "titles = response.xpath(\"//div[@class='b-product-block__image']//img[@class='lazy']/@title\").extract()\n",
    "prices = response.xpath(\".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()\").extract()\n",
    "sales = [response.xpath(\".//div[@class='b-product-block__gain']\").extract_first() if 'b-product-block__gain' in i else '0' for i in response.xpath(\".//div[@class='b-product-block__price']\").extract()]\n",
    "\n",
    "# Process the prices\n",
    "prices = [price.replace(\"\\n\", \"\") for price in response.xpath(\".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()\").extract()]\n",
    "prices = [price.replace(\"\\xa0\", \"\") for price in prices]\n",
    "prices = [price.strip() for price in prices]\n",
    "prices = [int(price) for price in prices if price]\n",
    "\n",
    "# Process the discounts\n",
    "sales = [sale.replace(\"\\xa0\", \"\") for sale in sales] \n",
    "sales = [sale.strip() for sale in sales] \n",
    "sales = [re.findall(\"\\d+\", sale) for sale in sales]\n",
    "sales = [item for sublist in sales for item in sublist] \n",
    "sales = [int(sale) for sale in sales] "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we want to have the data in a usual format. First the data will be stored in dictionaries and then we will pass it to internal `Scrapy` scripts, so that we would yeild a table in `.csv` format.<br>\n",
    "\n",
    "We will use a `for` cycle and `zip` construction.<br>\n",
    "We will also use a [`yield`](https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do) generator so that a spider would use it and form a dictionary for each request.<br>\n",
    "All together look like this:<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for item in zip(titles,prices,sales,imgs):\n",
    "            scraped_info = {\n",
    "                'title' : item[0],\n",
    "                'price' : item[1],\n",
    "                'sale_offer': sales[2],\n",
    "                'image_urls' : [item[3]]}\n",
    "            yield scraped_info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Pagination. How to iterate through all pages."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Scrapy` has built in structures for extracting page links and defining rules to crawl, but I decided to make it very simple and make a list of links with agenerator and merge them. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Paginartion. \n",
    "allowed_domains = ['http://www.svyaznoy.ru/']\n",
    "first_page = ['http://www.svyaznoy.ru/catalog/phone/224/']\n",
    "all_others = ['http://www.svyaznoy.ru/catalog/phone/224/page-'+str(x) for x in range(2,109)]\n",
    "\n",
    "# Locate 1st page.\n",
    "start_urls = first_page + all_others"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apart from it we will also have to call our `spider` to crawl the pages for each request."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "next_page = response.xpath(\".//li[@class='next']//a/@data-page\").extract()\n",
    "next_page = str(int(next_page[0])+1)\n",
    "if next_page is not None:\n",
    "    next_page = response.urljoin(next_page)\n",
    "    yield scrapy.Request(next_page, callback=self.parse)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Some finishing touches."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition, we will have to inclide several adjustments in our scripts.\n",
    "Firstly, we will have to specify where we would like to locate our results in our main script - `svz.py`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "custom_settings = {'FEED_URI' : 'results/svyaznoy.csv'}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Secondly, we will have to look in our project folder `svyaznoy` for a `settings.py` script and include some parameters, listed below. <br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "BOT_NAME = 'svyaznoy'\n",
    "\n",
    "SPIDER_MODULES = ['svz.spiders']\n",
    "NEWSPIDER_MODULE = 'svz.spiders'\n",
    "FEED_FORMAT = \"csv\"\n",
    "FEED_URI = \"svyaznoy.csv\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And lastly, we will need to spcify the pipeline to download phone photos by extracted links. (include in `settings.py`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ITEM_PIPELINES = {\n",
    "  'scrapy.pipelines.images.ImagesPipeline': 1\n",
    "}\n",
    "IMAGES_STORE = 'results/images/'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Done!** <br>\n",
    "Other scripts in `spiders` forder do not need any adjustments, at least for our purposes. <br>\n",
    "Locate yourself in project folder(`svyaznoy`), make some popcorn and run the script with the command `scpay crawl svyaznoy`. Enjoy.<br>\n",
    "\n",
    "<img src=\"https://habrastorage.org/webt/st/p6/pp/stp6pp5g_bnqusowlvk_pujsisw.gif\" width=500/>\n",
    "<br> \n",
    "\n",
    "After couple minutes we will get a report with stats about the job done. Something like that:<br>\n",
    "<img src=\"https://habrastorage.org/webt/hx/y7/7p/hxy77pebbjsb_dclzdqe0h5bqfs.png\" width=500/><br>\n",
    "\n",
    "We can see that the majority of data was scraped, while some pages occured with 301 code, which is redirecting.\n",
    "Of course there are tips and tricks how to deal with that as well, but I will leve it to the reader to find out in the documentation.<br> \n",
    "\n",
    "You can also notice small resized photos and a csv file available in `result` folder.<br>\n",
    "\n",
    "<img src=\"https://habrastorage.org/webt/z8/fs/1x/z8fs1x8wxby-yyz3sqkae4b_ycg.png\" width=250/><br> \n",
    "<img src=\"https://habrastorage.org/webt/8u/io/ha/8uiohacxmbxik2hmhvbdiq0de0u.png\" width=1000/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. All together. A full script."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The script below can be just _\"Ctl+C/Ctrl-V\"_ to a key spider/crawler script - `svy.py`. Don't forget the additional adjustments in `settings.py` <br>\n",
    "\n",
    "**_P.S._** Bare in mind the indentation problem (4-spaces or Tab) when writing/debugging code in text editors/IDE. <br>\n",
    "In my case, I choose Tabs.<br>\n",
    "<img src=\"https://i.ytimg.com/vi/V7PLxL8jIl8/hqdefault.jpg\"> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "import scrapy\n",
    "\n",
    "\n",
    "class SvzSpider(scrapy.Spider):\n",
    "\n",
    "\n",
    "\tcustom_settings = {'FEED_URI' : 'results/svyaznoy.csv'}\n",
    "\n",
    "\tname = 'svyaznoy'\n",
    "\n",
    "\t\"\"\" Making a proper list of pages. \"\"\"\n",
    "\tallowed_domains = ['http://www.svyaznoy.ru/']\n",
    "\tfirst_page = ['http://www.svyaznoy.ru/catalog/phone/224/']\n",
    "\tall_others = ['http://www.svyaznoy.ru/catalog/phone/224/page-'+str(x) for x in range(2,109)]\n",
    "\t\n",
    "\t\"\"\" Inserting the 1st page. \"\"\"\n",
    "\tstart_urls = first_page+all_others\n",
    "\n",
    "\n",
    "\tdef parse(self, response):\n",
    "        \n",
    "\t\t# Retrieving objects\n",
    "\t\timgs = response.xpath(\"//div[@class='b-product-block__image']//img[@class='lazy']/@data-original\").extract()\n",
    "\t\ttitles = response.xpath(\"//div[@class='b-product-block__image']//img[@class='lazy']/@title\").extract()\n",
    "\t\tprices = response.xpath(\".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()\").extract()\n",
    "\t\tsales = [response.xpath(\".//div[@class='b-product-block__gain']\").extract_first() if 'b-product-block__gain' in i else '0' for i in response.xpath(\".//div[@class='b-product-block__price']\").extract()]\n",
    "\n",
    "\t\t# Processing prices\n",
    "\t\tprices = [price.replace(\"\\n\", \"\") for price in response.xpath(\".//div[@class='b-product-block__price']//span[@class='b-product-block__visible-price']/text()\").extract()]\n",
    "\t\tprices = [price.replace(\"\\xa0\", \"\") for price in prices]\n",
    "\t\tprices = [price.strip() for price in prices]\n",
    "\t\tprices = [int(price) for price in prices if price]\n",
    "\n",
    "\t\t# Processing sale offers\n",
    "\t\tsales = [sale.replace(\"\\xa0\", \"\") for sale in sales] \n",
    "\t\tsales = [sale.strip() for sale in sales] \n",
    "\t\tsales = [re.findall(\"\\d+\", sale) for sale in sales]\n",
    "\t\tsales = [item for sublist in sales for item in sublist] \n",
    "\t\tsales = [int(sale) for sale in sales] \n",
    "        \n",
    "\t\t# Yielding objects    \n",
    "\t\tfor item in zip(titles,prices,sales,imgs):\n",
    "\t\t\t\tscraped_info = {\n",
    "\t\t\t\t\t'title' : item[0],\n",
    "\t\t\t\t\t'price' : item[1],\n",
    "\t\t\t\t\t'sale_offer': sales[2],\n",
    "\t\t\t\t\t'image_urls' : [item[3]]}\n",
    "\t\t\t\tyield scraped_info\n",
    "  \n",
    "\t\t# Pagination loop       \n",
    "\t\tnext_page = response.xpath(\".//li[@class='next']//a/@data-page\").extract()\n",
    "\t\tnext_page = str(int(next_page[0])+1)\n",
    "\t\tif next_page is not None:\n",
    "\t\t\tnext_page = response.urljoin(next_page)\n",
    "\t\t\tyield scrapy.Request(next_page, callback=self.parse)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  7. Resourses and useful links.\n",
    "\n",
    "1. [Scrapy documentation](https://doc.scrapy.org/en/latest/intro/overview.html)<br>\n",
    "2. [XML и Xpath basics](https://www.w3schools.com/xml/xml_whatis.asp)<br>\n",
    "3. [Xpath selectors](https://blog.michaelyin.info/scrapy-tutorial-7-how-use-xpath-scrapy)<br>\n",
    "4. [Tutorial #1](https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/)<br>\n",
    "5. [Tutorial #2](https://blog.michaelyin.info/scrapy-tutorial-7-how-use-xpath-scrapy/)<br>\n",
    "6. [Pagination in a script](http://qaru.site/questions/247879/scrapy-parsing-items-that-are-paginated)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
